DETERMINING SIGNAL TRANSDUCTION PATHWAYS 



This application claims the benefit of copending provisional application 

no. 60/106,912 fif^d November 3, 1998, which is expressly incorporated by 

■> 

reference in its entirety herein. 

TECHNICAL FIELD OF THE INVENTION 

This invention is related functional mediators of genes and functional 
organization of such mediators into pathways. 
BACKGROUND OF THE INVENTION 

Many biological functions are accomplished by altering the expression 
of various genes through transcriptional (e.g. through control of initiation, 
provision of RNA precursors, RNA processing, etc.) and/or translational 
control. For example, fundamental biological processes such as cell cycle 
regulation, cell differentiation and cell death, are often characterized by the 
variations in the expression levels of groups of genes. 

Gene expression is also associated with pathogenesis. For example, the 
iacK oi suniciem expression or runcxionai tumor suppressor genes ana/or tne 
over expression of oncogene/protccncogenes could lead to tumorgen^sis 
(Marshall, Cell, 64: 313-326 (1991); Weinberg, Science, 254: 1138-1146 
(1991), incorporated herein by reference for all purposes). Thus, changes in 
the expression levels of particular genes (e.g. oncogenes or tumor suppressors) 
serve as signposts for the presence and progression of various diseases. 

The study of gene expression in the art has been generally concentrated 
on the regulatory regions of the gene of interest and on the relationships among 



a few genes A number of transcriptional factors/DNA binding proteins have 
been identified and a limited number of regulatory pathways have been 
discovered. However, the expression of a particular gene is frequently 
regulated by the expression of a large number of other genes. The expression 
5 of those regulatory genes may also be under the control of additional genes. 

This complex regulatory relationship among genes constitutes a genetic 
network. The function and regulation of a particular gene can be best 
understood in the context of this genetic network. As the Human Genome 
Project and commercial genome research progress at a great rate, most, if not 
10 all, of the expressed genes will be partially sequenced in the near future. 

Understanding the functions and regulatory relationships among the large 
number of genes is becoming a difficult task with traditional tools. 

Communication through signal transduction pathways is usually 
achieved through a combination of transcriptional and post-transcriptional 
15 mechanisms. Because signaling events are causally linked within given 

pathways, genetic alterations that disrupt either of these two classes of 
regulatory mechanisms should ultimately affect gene expression. 

There is a need in the art to develop a systematic approach to 
understand the complex regulatory relationships among large numbers of 
20 genes, in particular those involved in signal transduction. 

SUMMARY OF THE INVENTION 

It is an object of the present invention to provide methods of 
determining candidate functional mediators of genes. 

It is another object of the present invention to provide methods for 
25 determining a pathway of gene products. 

or more of the embodiments dexcribwl below. In one embodiment a method 
is provided for determining candidate functional mediators of a transgene. The 
method comprises: altering expression of a first selected gene in a first of two 
30 populations of identical cells. Expression levels of a set of genes in the two 

populations of cells are compared Genes in the set whose expression levels 
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differ between the two populations of cells are identified. The genes identified 
are candidate functional mediators of the selected gene. Expression of a 
second selected gene is then altered in one of a third and fourth populations of 
cells. The third and fourth populations comprise identical cells. The second 
selected gene is a candidate functional mediator of the first selected gene. 
Expression levels of a set of genes in the third and fourth populations of cells 
are compared. Genes in the set whose expression levels differ between the 
third and fourth populations of cells are identified. The genes identified are 
candidate functional mediators of the second selected gene. 

According to another embodiment a method is provided for identifying 
pathways of functional mediators of a selected gene. The method comprises 
altering expression of a first selected gene in a first of two populations of 
identical cells. Expression levels of a set of genes in the two populations of 
cells are compared. Genes in the set whose expression levels differ between 
the two populations of cells are identified. The genes identified form a set of 
candidate functional mediators of the first selected gene. Expression of a 
second selected gene is altered in one of a third and fourth populations of cells 
which populations comprise identical cells. Expression levels of the set of 
genes in the third and fourth populations of cells are compared. Genes in the 
set whose expression levels differ between the third and fourth populations of 
cells are identified. The genes identified form a set of candidate functional 
mediators of the second selected gene. The candidate functional mediators 
identified of the first and second selected genes are compared. Genes which 
are identified as candidate functional mediators of both selected genes suggest 
that the first and second selected genes are components of a common pathway. 
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suggests that the two selected genes are in different pathways. Identification 
of the set of candidate functional mediators of the first selected gene as a 
subset of the set of candidate functional mediators of the second selected gene 
suggests that the first selected gene is downstream in a pathway relative to the 
second selected gene. A candidate functional mediator which is identified in 



only one of the two sets of candidate functional mediators is identified as 
upstream in the pathway of a selected gene from a point of convergence with 
the pathway of the other selected gene, if the sets of candidate functional 
mediator genes of the first and second selected genes contain common 
5 members. 

In still another aspect of the invention, a method is provided to 
determine a pathway of gene products. The method comprises comparing a 
first set of genes with a second set of genes. The first set is identified by 
comparing a first set of candidate functional mediator genes with a second set 
10 of candidate functional mediator genes. The first set is identified by the 

process of: 

(a) comparing expression levels of a set of genes in 
two populations of identical cells, wherein a first of the 
two populations of cells has been treated to alter 

15 expression of a first selected gene; 

(b) identifying genes in the set whose expression levels 
differ between the two populations of cells, wherein the 
genes identified are candidate functional mediators of 
the first selected gene. 

20 The second set is identified by the process of: 

(c) comparing expression levels of the set of genes in 
a third and fourth population of cells, wherein the third 
population of cells has been treated to alter expression 
of a second selected gene; 

' « n • * • • . «. ~~ 

between the thhu and fcuriij populations of identical 
cells, wherein the genes identified are candidate 
functional mediators of the second selected gene. 
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The first and second selected genes are identified as components of a common 
pathway when one or more genes are found to be candidate functional 
mediators of both of said first and said second selected genes. Alternatively, 
the first and second selected genes are identified as being in different pathways 
when no gene is identified as being a candidate functional mediator of both of 
said first and said second selected genes. In another embodiment a gene which 
is found to be a candidate functional mediator of only one of said first and said 
second selected genes is identified as upstream in the pathway of the first or 
second selected gene from a point of convergence with the pathway of the 
second or first selected gene, if the first and second sets of candidate functional 
mediator genes contain common members. In still another embodiment the 
first selected gene is identified as downstream in a pathway relative to the 
second selected gene if the set of candidate functional mediators of the first 
selected gene is a subset of the set of candidate functional mediators of the 
second selected gene. 

These approaches can be used to interrogate the genetic regulatory 
network and to construct a map indicating regulatory relationships. 
BRIEF DESCRIPTION OF THE DRAWINGS 

Figure 1 displays proposed mechanistic advantages of BRCA1 inactivation. 
Gens A to H display reproducibly altered expression patterns following 
BRCA1 induction. Arrows pointing up indicate increases in expression, and 
arrows pointing down indicate decreases in expression. The directionality of 
these arrrows is opposite to that of the expression changes observed following 
BRCA1 induction; they are meant to indicate the putative effects of BRCA1 
inactivation. 

~ ' " - — • - • • - ... 

with high reliability Many graphs aie comp^bi* vA<ii ih* dni» hut *11 snr.h 
graphs are subgraphs of the transitive closure graph G*. 
Figures 3A through 3C demonstrate that interactions not at the level of 
expression are concealed in the connectivity graph. Genes B* (regulated by 
A), E (not regulated by A), and A are indistinguishable at the expression level 



DETAILED DESCRIPTION 

The present invention is directed to the systematic analysis of 
relationships between expression patterns of genes affected by other genes. 
This approach represents a paradigm shift away from research efforts focusing 
5 on single genes in isolation and towards integrated analyses of multiple-gene 

networks. While the effect of any gene on any other gene can be studied, the 
methods are particularly useful for analyzing the effects of tumor suppressor 
genes and oncogenes. 

Any means known in the art can be utilized for altering the expression 

10 of putative effector genes. In one exemplary means a cell which is null for the 

putative effector gene is compared to an isogenic cell which is wild-type for the 
effector gene. In another exemplary means cells are used which contain a 
transgene under the control of an exogenously regulated promoter. Two 
populations of the exogenously regulatable cells are compared: cells which are 

15 treated so that the transgene is expressed, and cells which are treated so that 

the transgene is not expressed. Other means for altering expression of a 
putative effector gene include mutagenizing the putative effector gene, 
administering antisense oligonucleotides or antisense-expressing constructs to 
cells to inhibit translation of the mRNA of the putative effector gene, 

20 administering ribozymes or ribozyme-expressing constructs to cells to inhibit 

translation of the mRNA of the putative effector gene, and altering copy 
number of the putative effector gene. Techniques for accomplishing these 
means of altered expression are well known in the art and any can be used as 
is desirable and convenient. 

25 The methods disclosed herein model the networks of genes as 
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discrete states This is * fundamentally computational and combinatorial 
approach, which explicitly deals with interactions within a network and requires 
no time-series data. Since we are making single, simple modifications to genes, 
30 and can reliably detect changes in thousands of putative effected genes using 

only a few experiments, our methodology is combinatorial in nature. This 
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permits us to trace the flow of signal information down pathways, and hence 
our problem is tractable with relatively few data collection events. 

Cell lines harboring single alterations in genes known or unknown to 
reside in common pathways can be used. Pathway modeling algorithms 
recognize links among the products of these genes and order these products 
within regulatory cascades. Model regulatory systems which involve genes that 
act in pathways targeted for mutation during tumorigenesis can be used, for 
example. Specifically, we can compare the expression profiles of cell lines that 
differ from one another only by 1) the introduction of a single inducible 
transgene, or by 2) the deletion of a single endogenous gene. As discussed 
above, other alterations can also be used to effect expression of a single gene. 
Any attendant expression changes can be attributed to a unique genetic change, 
i.e., to the alteration of the single gene. Expression profiles for thousands of 
genes can be obtained simultaneously by hybridizing labeled RNA (or derived 
cDNA) from these cell lines to high-density oligonucleotide arrays. Other 
methods known in the art for obtaining expression data of multiple genes can 
also be used, including the Serial Analysis of Gene Expression technique. See 
U.S. Patent No. 5,695,937. 

Transgenes can be introduced into cells or animals. The cells or 
animals may or may not lack those genes. For example, a cell which lacks p53 
can be supplied with a p53 transgene from another cell or animal. 
Alternatively, additional gene copy number can be achieved by introducing an 
additional copy of a gene to a cell or animal which contains the gene. Thus 
transgenes need not be from other species, although they can be. Cells which 
are lacking the genes can be naturally so or experimentally induced. The "lack" 
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another gene produf *). 

Exogenously regulatable promoters can be used to alter expression of 
the effector genes; these can be inducible or repressible. The effector of 
regulation can be a chemical, e.g., a hormone or drug, or another agent such 
as y-in-adiation which can be controlled experimentally. Treatment to permit 



transcription or to permit no transcription can be active or passive. Thus not 
applying an agent can constitute "treatment" to achieve a certain transcriptional 
state. 

Mutagenesis can be used to alter expression of a selected gene. Any 
means known in the art can be used, although controlled methods are preferred 
to eliminate the possibility of multiple mutations, especially those in other 
genes. 

Antisense constructs or oligonucleotides and ribozyme constructs or 
oligonucleotides can also be used to alter expression of a selected gene. 
Methods for making and administering these agents are also well known, and 
any such methods can be used in the context of the present invention for 
altering expression. 

Determining expression levels can be done by observing, quantitatively 
or qualitatively, expression of a gene product. The gene product can be mRNA 
or protein. The actual gene product can be observed or some derivative, such 
as cDNA. Protein can be observed by any means known in the art, including 
immunological methods and enzyme assays. Any method for observing gene 
expression can be used, without limitation. Absolute measurements of the 
expression levels need not be made, although they can be made. Comparisons 
of differences in expression levels is, however, required. A preferred method 
utilizes the simultaneous analysis of expression of multiple genes, such as using 
the Serial Analysis of Gene Expression (SAGE) method or using hybridization 
to DNA arrays. 

Comparison of expression levels can be done visually or manually, or 
may be automated and done by a machine, using for example optical detection 

uioaiia. iiaiuwtuc <uiu suiiwaic iui analysing uixTviwutioi \^vpi waoivsu v/x w. 

«rt- available and may be used m practicing the present invention 

Once a candidate functional mediator (CFM) is identified, it can be used 
in an iterative fashion as an effector gene to determine the effect that it has on 
other (downstream) genes. The CFM becomes a selected gene which is the 
target of expression altering treatment. Thus, for example, exogenously 
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regulatable constructs can be made using the CFM, or mutant cells null for the 
CFM can be used. This can be done repeatedly to "work down the pathway", 
or it can be done for multiple CFMs identified to expand into branches of the 
pathway. 

5 Any gene can be used in the present invention as a selected gene (an 

effector gene), e.g., as a transgene or as a mutated null gene. Tumor genes or 
oncogenes are particularly interesting, but the invention is not limited to the 
type of gene used. The methods do not rely on function of the gene product, 
thus the function is not a limitation. 

10 The methods of the invention permit the ordering of effector genes 

relative to each other in a pathway. It also permits the identification of and 
ordering of candidate functional mediator genes in a pathway. Comparing sets 
of effected genes and finding overlaps in the sets and non-overlaps in the sets 
permits the reconstruction of pathways. The pathways indicate which gene 

15 products influence the expression of other gene products. 

This invention provides methods for interrogating the genetic network. 
The methods involve quantifying the level of expression of a large number of 
genes. In some preferred embodiments, a high density oligonucleotide array 
is used to hybridize with a target nucleic acid sample to detect the expression 

20 level of a large number of genes, preferably more than 1 0, more preferably 

more than 100, and most preferably more than 1000 genes. 

Activity of a gene is reflected by the activity of its product(s): the 
proteins or other molecules encoded by the gene. Those product molecules 
perform biological functions. Directly measuring the activity of a gene product 

25 is, however, often difficult for certain genes. Instead, the immunological 

— . - — — v i ~- • ~ ***** s-t*. vuv liiiiu pi vuuv^yj^ \_v A no pvpiiUV kJl UCVDOU AK, 

intermediates are determined as a measurement of the eene activitv Mom 
frequently, the amount or activity of intermediates, such as transcripts, RNA 
processing intermediates, or mature mRNAs are detected as a measurement of 
30 gene activity. 
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In many cases, the form and function of the final product(s) of a gene 
is unknown. In those cases, the activity of a gene is measured conveniently by 
the amount or activity of transcript(s), RNA processing intermediate(s), mature 
mRNA(s) or its protein product(s) or functional activity of its protein 
5 product(s). 

Any methods that measure the activity of a gene are useful for at least 
some embodiments of this invention. For example, traditional Northern 
blotting and hybridization, nuclease protection, RT- PCR and differential 
display have been used for detecting gene activity. Those methods are useful 
10 for some embodiments of the invention. However, this invention is most useful 

in conjunction with methods for detecting the expression of a large number of 
genes. 

High density arrays are particularly useful for monitoring the 
expression control at the transcriptional, RNA processing and degradation 

15 level. The fabrication and application of high density arrays in gene expression 

monitoring have been disclosed previously in, for example, WO 97/10365, WO 
92/10588, U.S. Application Ser. No. 08/772,376 filed December 23, 1996; 
serial number 08/529,115 filed on September 15, 1995; serial number 
08/168,904 filed December 15, 1993; serial number 07/624,114 filed on 

20 December 6, 1990, serial number 07/362,901 filed June 7, 1990, all 

incorporated herein for all purposes by reference. In some embodiments using 
high density arrays, high density oligonucleotide arrays are synthesized using 
methods such as the Very Large Scale Immobilized Polymer Synthesis 
(VLSIPS) disclosed in U.S. Pat. No. 5,445,934 incorporated herein for all 

25 purposes by reference. Each oligonucleotide occupies a known location on a 

t . . i> a • • « «•*••«• « * a * a * • 
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of oligonucleotides *nd then the amount of target nucleic acids hybridized to 
each probe in the array is quantified. One preferred quantifying method is to 
use confocal microscope and fluorescent labels. The GeneChip® system 
30 (Asymetrix, Santa Clara, CA) is particularly suitable for quantifying the 

hybridization; however, it will be apparent to those of skill in the art that any 
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similar systems or other effectively equivalent detection methods can also be 
used. 

High density arrays are suitable for quantifying a small variations in 
expression levels of a gene in the presence of a large population of 
heterogeneous nucleic acids. Such high density arrays can be fabricated either 
by de novo synthesis on a substrate or by spotting or transporting nucleic acid 
sequences onto specific locations of substrate. Nucleic acids are purified 
and/or isolated from biological materials, such as a bacterial plasmid containing 
a cloned segment of sequence of interest. Suitable nucleic acids are also 
produced by amplification of templates. As a nonlimiting illustration, 
polymerase chain reaction, and/or in vitro transcription, are suitable nucleic 
acid amplification methods. 

Synthesized oligonucleotide arrays are particularly preferred for this 
invention. Oligonucleotide arrays have numerous advantages, as opposed to 
other methods, such as efficiency of production, reduced intra- and inter array 
variability, increased information content and high signal-to-noise ratio. 

Preferred high density arrays for gene function identification and 
genetic network mapping comprise greater than about 100, preferably greater 
than about 1000, more preferably greater than about 16,000 and most 
preferably greater than 65,000 or 250,000 or even greater than about 
1,000,000 different oligonucleotide probes, preferably in less than 1 cm 2 of 
surface area. The oligonucleotide probes range from about 5 to about 50 or 
about 500 nucleotides, more preferably from about 10 to about 40 nucleotide 
and most preferably from about 15 to about 40 nucleotides in length. 
Massive Parallel Gene Expression Monitoring 

f* iiiv.uiwv* iiiaaaivc yea aiici gene CApit&MUH uiuiiuuiiiig 

based upon high density nucleic <tcid arrays Nucleic * cid array methods for 
monitoring gene expression are disclosed and discussed in detail in PCT 
Application WO 092.10588 (published on June 25, 1992), all incorporated 
herein by reference for all purposes. 
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Generally those methods of monitoring gene expression involve (a) 
providing a pool of target nucleic acids comprising RNA transcript(s) of one 
or more target gene(s), or nucleic acids derived from the RNA transcript(s); (b ) 
hybridizing the nucleic acid sample to a high density array of probes and (c) 
detecting the hybridized nucleic acids and calculating a relative and/or absolute 
expression (transcription, RNA processing or degradation) level. 

(A) Providing a Nucleic A cid Sample 

One of skill in the art will appreciate that it is desirable to have nucleic 
samples containing target nucleic acid sequences that reflect the transcripts of 
interest. Therefore, suitable nucleic acid samples may contain transcripts of 
interest. Suitabie nucleic acid samples, however, may contain nucleic acids 
derived from the transcripts of interest. As used herein, a nucleic acid derived 
from a transcript refers to a nucleic acid for whose synthesis the mRNA 
transcript or a subsequence thereof has ultimately served as a template. Thus, 
a cDNA reverse transcribed from a transcript, an RNA transcribed from that 
cDNA, a DNA amplified from the cDNA, an RNA transcribed from the 
amplified DNA, etc. , are all derived from the transcript and detection of such 
derived products is indicative of the presence and/or abundance of the original 
transcript in a sample. Thus, suitable samples include, but are not limited to, 
transcripts of the gene or genes, cDNA reverse transcribed from the transcript, 
cRNA transcribed from the cDNA, DNA amplified from the genes, RNA 
transcribed from amplified DNA, and the like. 

Transcripts, as used herein, may include, but not limited to pre-mRNA nascent 
transcript(s), transcript processing intermediates, mature mRNA(s) and 

J _ 1 . T » * . 
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practice this invention. For example ow* may r.hor kw* to practice the invention 
to measure the mature mRNA levels only. 

In one embodiment, such sample is a homogenate of cells or tissues or 
other biological samples. Preferably, such sample is a total RNA preparation 
of a biological sample. More preferably in some embodiments, such a nucleic 
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acid sample is the total mRNA isolated from a biological sample. Those of skill 
in the art will appreciate that the total mRNA prepared with most methods 
includes not only the mature mRNA, but also the RNA processing 
intermediates and nascent pre-mRNA transcripts. For example, total mRNA 
purified with a poly (dT) column contains RNA molecules with poly (A) tails. 
Those polyA + RNA molecules could be mature mRNA, RNA processing 
intermediates, nascent transcripts or degradation intermediates. 
Biological samples may be of any biological tissue or fluid or cells from any 
organism. Frequently the sample will be a "clinical sample" which is a sample 
derived from a patient. Clinical samples provide a rich source of information 
regarding the various states of genetic network or gene expression. Some 
embodiments of the invention are employed to detect mutations and to identify 
the phenotype of mutations. Such embodiments have extensive applications in 
clinical diagnostics and clinical studies. Typical clinical samples include, but 
are not limited to, sputum, blood, blood cells (e.g., white cells), tissue or fine 
needle biopsy samples, urine, peritoneal fluid, and pleural fluid, or cells 
therefrom. Biological samples may also include sections of tissues, such as 
frozen sections or formalin fixed sections taken for histological purposes. 
Another typical source of biological samples are cell cultures where gene 
expression states can be manipulated to explore the relationship among genes. 
In one aspect of the invention, methods are provided to generate biological 
samples reflecting a wide variety of states of the genetic network. 

One of skill in the art would appreciate that it is desirable to inhibit or 
destroy RNase present in homogenates before homogenates can be used for 
hybridization. Methods of inhibiting or destroying nucleases are well known 

-A-u ~ — i.T„ „ c i i » • \t 

in the presence of chaotropic agents to inhibit nuclease In some other 
embodiments, RNase is inhibited or destroyed by heat treatment followed by 
proteinase treatment. 

Methods of isolating total mRNA are also well known to those of skill 
in the art. For example, methods of isolation and purification of nucleic acids 
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are described in detail in Chapter 3 of Laboratory Techniques in Biochemistry 
and Molecular Biology: Hybridization With Nucleic Acid Probes, Part I. 
Theory and Nucleic Acid Preparation, P. Tijssen, ed. Elsevier, N Y. (1993) 
and Chapter 3 of Laboratory Techniques in Biochemistry and Molecular 
Biology: Hybridization With Nucleic Acid Probes, Part I. Theory and 
Nucleic Acid Preparation, P. Tijssen, ed. Elsevier, NY. (1993)). 

In a preferred embodiment, the total RNA is isolated from a given 
sample using, for example, an acid guanidinium-phenol-chloroform extraction 
method and polyA+ mRNA is isolated by oligo(dT) column chromatography 
or by using (dT) on magnetic beads {see, e.g., Sambrook et al, Molecular 
Cloning: A Laboratory Manual (2nd ed ), Vols. 1-3, Cold Spring Harbor 
Laboratory, (1989), or Current Protocols in Molecular Biology, F. Ausubel 
etal., ed. Greene Publishing and Wiley-Interscience, New York (1987)) 
Frequently, it is desirable to amplify the nucleic acid sample prior to 
hybridization. One of skill in the art will appreciate that whatever amplification 
method is used, if a quantitative result is desired, care must be taken to use a 
method that maintains or controls for the relative frequencies of the amplified 
nucleic acids to achieve quantitative amplification. 

Methods of "quantitative" amplification are well known to those of skill in the 
art. For example, quantitative PCR involves simultaneously co-amplifying a 
known quantity of a control sequence using the same primers. This provides 
an internal standard that may be used to calibrate the PCR reaction. The high 
density array may then include probes specific to the internal standard for 
quantification of the amplified nucleic acid. 

One preferred internal standard is a synthetic AW 106 cRNA. The 

standard techniques known to those of skilled in ihe art. The RNA is then 
reverse transcribed using a reverse transcriptase to provide copy DNA. The 
cDNA sequences are then amplified (e.g. , by PCR) using labeled primers. The 
amplification products are separated, typically by electrophoresis, and the 
amount of radioactivity (proportional to the amount of amplified product) is 
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determined. The amount of mRNA in the sample is then calculated by 

comparison with the signal produced by the known AW 106 RNA standard. 

Detailed protocols for quantitative PCR are provided in PCR Protocols, A 

Guide to Methods and Applications, Innis el ai, Academic Press, Inc. N.Y., 
(1990). 

Other suitable amplification methods include, but are not limited to 
polymerase chain reaction (PCR) (Innis, et ai, PCR Protocols. A guide to 
Methods and Application. Academic Press, Inc. San Diego, (1990)), ligase 
chain reaction (LCR) (see Wu and Wallace, Genomics, 4. 560 (1989), 
Landegren, et ai, Science, 241: 1077 (1988) and Barringer, et ai, Gene, 89: 
117 (1990), transcription amplification (Kwoh, et al., Proc. Natl. Acad. Sci. 
USA, 86: 1 173 (1989)), and self-sustained sequence replication (Guatelli, et ai, 
Proc. Nat. Acad. Sci. USA, 87: 1874 (1990)). 

Cell lysates or tissue homogenates often contain a number of inhibitors 
of polymerase activity. Therefore, RT-PCR typically incorporates preliminary 
steps to isolate total RNA or mRNA for subsequent use as an amplification 
template. A one-tube mRNA capture method may be used to prepare poly(A) + 
RNA samples suitable for immediate RT-PCR in the same tube (Boehringer 
Mannheim). The captured mRNA can be directly subjected to RT-PCR by 
adding a reverse transcription mix and, subsequently, a PCR mix. 

In a particularly preferred embodiment, the sample mRNA is reverse 
transcribed with a reverse transcriptase and a primer consisting of oligo(dT) 
and a sequence encoding the phage T7 promoter to provide single stranded 
DNA template. The second DNA strand is polymerized using a DNA 
polymerase. After synthesis of double-stranded cDNA, T7 RNA polymerase 

i^uv^. oiiu ivii-ri is uoiisvnucu nuiii inc \,uiyt\ template, successive rounus 

of transcription from each single cDNA. template results in amplified RNA. 
Methods of in vitro polymerization are well known to those of skill in the art 
(see, e.g., Sambrook, supra.) and this particular method is described in detail 
by Van Gelder, etal, Proc. Natl. Acad. Sci. USA, 87: 1663-1667 (1990) who 
demonstrate that in vitro amplification according to this method preserves the 
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relative frequencies of the various RNA transcripts Moreover, Eberwine et al 
Proc. Natl Acad. Sci. USA, 89: 3010-3014 provide a protocol that uses two 
rounds of amplification via in vitro transcription to achieve greater than 10 6 
fold amplification of the original starting material, thereby permitting 
expression monitoring even where biological samples are limited. 

It will be appreciated by one of skill in the art that the direct 
transcription method described above provides an antisense (aRNA) pool. 
Where antisense RNA is used as the target nucleic acid, the oligonucleotide 
probes provided in the array are chosen to be complementary to subsequences 
of the antisense nucleic acids. Conversely, where the target nucleic acid pool 
is a pool of sense nucleic acids, the oligonucleotide probes are selected to be 
complementary to subsequences of the sense nucleic acids. Finally, where the 
nucleic acid pool is double stranded, the probes may be of either sense as the 
target nucleic acids include both sense and antisense strands. 

The protocols cited above include methods of generating pools of either 
sense or antisense nucleic acids. Indeed, one approach can be used to generate 
either sense or antisense nucleic acids as desired. For example, the cDNA can 
be directionally cloned into a vector (e.g., Stratagene's p Bluscript II KS (+) 
phagemid) such that it is flanked by the T3 and T7 promoters. In vitro 
transcription with the T3 polymerase will produce RNA of one sense (the sense 
depending on the orientation of the insert), while in vitro transcription with the 
T7 polymerase will produce RNA having the opposite sense. Other suitable 
cloning systems include phage lambda vectors designed for Cre-loxP plasmid 
subcloning (see e.g., Palazzolo etal, Gene, 88: 25-36 (1990)). 

fn\ xt, a — j j : ~ — 1 . • • t i • « 

1 . Probe design 

One of skill in the art will appreciate that an enormous number of array 
designs are suitable for the practice of this invention. The high density array 
will typically include a number of probes that specifically hybridize to the 
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sequences of interest. In addition, in a preferred embodiment, the array will 
include one or more control probes. 

The high density array chip includes "test probes ." Test probes could 
be oligonucleotides that range from about 5 to about 45 or 5 to about 500 
nucleotides, more preferably from about 10 to about 40 nucleotides and most 
preferably from about 15 to about 40 nucleotides in length. In other 
particularly preferred embodiments the probes are 20 or 25 nucleotides in 
length. In another preferred embodiments, test probes are double or single 
strand DNA sequences. DNA sequences are isolated or cloned from nature 
sources or amplified from nature sources using nature nucleic acid as 
templates. These probes have sequences complementary to particular 
subsequences of the genes whose expression they are designed to detect. 
Thus, the test probes are capable of specifically hybridizing to the target nucleic 
acid they are to detect. 

In addition to test probes that bind the target nucleic acid(s) of interest, 
the high density array can contain a number of control probes. The control 
probes fall into three categories referred to herein as 1) normalization controls; 
2) expression level controls; and 3) mismatch controls. 

Normalization controls are oligonucleotide or other nucleic acid probes 
that are complementary to labeled reference oligonucleotides or other nucleic 
acid sequences that are added to the nucleic acid sample. The signals obtained 
from the normalization controls after hybridization provide a control for 
variations in hybridization conditions, label intensity, "reading" efficiency and 
other factors that may cause the signal of a perfect hybridization to vary 
between arrays. In a preferred embodiment, signals (e.g., fluorescence 
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fluorescence intensity) from the control probes thereby riortr.aHzing *he 
measurements. 

Virtually any probe may serve as a normalization control. However, it 
is recognized that hybridization efficiency varies with base composition and 
probe length. Preferred normalization probes are selected to reflect the 
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average length of the other probes present in the array, however, they can be 
selected to cover a range of lengths The normalization control(s) can also be 
selected to reflect the (average) base composition of the other probes in the 
array, however in a preferred embodiment, only one or a few normalization 
5 probes are used and they are selected such that they hybridize well (i.e. no 

secondary structure) and do not match any target-specific probes. 

Expression level controls are probes that hybridize specifically with 
constitutively expressed genes in the biological sample. Virtually any 
constitutively expressed gene provides a suitable target for expression level 

10 controls. Typically expression level control probes have sequences 

complementary to subsequences of constitutively expressed "housekeeping 
genes" including, but not limited to the 13-actin gene, the transferrin receptor 
gene, the GAPDH gene, and the like. 

Mismatch controls may also be provided for the probes to the target 

15 genes, for expression level controls or for normalization controls. Mismatch 

controls are oligonucleotide probes or other nucleic acid probes identical to 
their corresponding test or control probes except for the presence of one or 
more mismatched bases. A mismatched base is a base selected so that it is not 
complementary to the corresponding base in the target sequence to which the 

20 probe would otherwise specifically hybridize. One or more mismatches are 

selected such that under appropriate hybridization conditions (e.g. stringent 
conditions) the test or control probe would be expected to hybridize with its 
target sequence, but the mismatch probe would not hybridize (or would 
hybridize to a significantly lesser extent). Preferred mismatch probes contain 

25 a central mismatch. Thus, for example, where a probe is a 20 mer, a 
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single base mismatch (e.g.. substituting a G, a C or a T for an A) at any nf 
positions 6 through 14 (the central mismatch). 

Mismatch probes thus provide a control for non-specific binding or 
30 cross-hybridization to a nucleic acid in the sample other than the target to 

which the probe is directed. Mismatch probes thus indicate whether a 
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hybridization is specific or not. For example, if the target is present the perfect 
match probes should be consistently brighter than the mismatch probes. In 
addition, if all central mismatches are present, the mismatch probes can be used 
to detect a mutation The difference in intensity between the perfect match and 
the mismatch probe (I(PM)-I(MM)) provides a good measure of the 
concentration of the hybridized material. 

The high density array may also include sample 
preparation/amplification control probes. These are probes that are 
complementary to subsequences of control genes selected because they do not 
normally occur in the nucleic acids of the particular biological sample being 
assayed. Suitable sample preparation/amplification control probes include, for 
example, probes to bacterial genes (e.g., Bio B) where the sample in question 
is a biological from a eukaryote. 

The RNA sample is then spiked with a known amount of the nucleic 
acid to which the sample preparation/amplification control probe is directed 
before processing. Quantification of the hybridization of the sample 
preparation/amplification control probe then provides a measure of alteration 
in the abundance of the nucleic acids caused by processing steps (e.g. PCR, 
reverse transcription, in vitro transcription, etc.). 

In a preferred embodiment, oligonucleotide probes in the high density 
array are selected to bind specifically to the nucleic acid target to which they 
are directed with minimal non-specific binding or cross-hybridization under the 
particular hybridization conditions utilized. Because the high density arrays of 
this invention can contain in excess of 1,000,000 different probes, it is possible 
to provide every probe of a characteristic length that binds to a particular 
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every possible 20-mer sequence complementary to an IL-2 mRNA. 

However, there may exist 20-mer subsequences that are not unique to 
the IL-2 mRNA. Probes directed to these subsequences are expected to cross- 
hybridize with occurrences of their complementary sequence in other regions 
of the sample genome. Similarly, other probes simply may not hybridize 
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effectively under the hybridization conditions (e.g. y due to secondary structure, 
or interactions with the substrate or other probes). Thus, in a preferred 
embodiment, the probes that show such poor specificity or hybridization 
efficiency are identified and may not be included either in the high density array 
itself {e.g., during fabrication of the array) or in the post-hybridization data 
analysis 

In addition, in a preferred embodiment, expression monitoring arrays 
are used to identify the presence and expression (transcription) level of genes 
which are several hundred base pairs long. For most applications it would be 
useful to identify the presence, absence, or expression level of several thousand 
to one hundred thousand genes. Because the number of oligonucleotides per 
array is limited in a preferred embodiment, it is desired to include only a limited 
set of probes specific to each gene whose expression is to be detected. 

As disclosed in U.S. Application Ser. No. 08/772,376, probes as short 
as 15, 20, or 25 nucleotide are sufficient to hybridize to a subsequence of a 
gene and that, for most genes, there is a set of probes that performs well across 
a wide range of target nucleic acid concentrations. In a preferred embodiment, 
it is desirable to choose a preferred or "optimum" subset of probes for each 
gene before synthesizing the high density array. 

2. Forming High Density Arrays. 

Methods of forming high density arrays of oligonucleotides, peptides 
and other polymer sequences with a minimal number of synthetic steps are 
known. The oligonucleotide analogue array can be synthesized on a solid 
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chemical coupling, and mechanically directed coupling. See Pirrung et a! U.S. 
Patent No. 5,143,854 (see also PCT Application No. WO 90/15070) and 
Fodor^a/., PCT Publication Nos. WO 92/10092 and WO 93/09668 and US 
Ser. No. 07/980,523 which disclose methods of forming vast arrays of 
peptides, oligonucleotides and other molecules using, for example, light- 
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directed synthesis techniques. See also, Fodor et al, Science, 251, 767-77 
(1991). These procedures for synthesis of polymer arrays are now referred to 
as VLSIPS™ procedures. Using the VLSIPS™ approach, one heterogeneous 
array of polymers is converted, through simultaneous coupling at a number of 
reaction sites, into a different heterogeneous array. See, U.S. Application 
Serial Nos. 07/796,243 and 07/980,523. 

The development of VLSIPS™ technology as described in the 
above-noted U.S. Patent No. 5, 143,854 and PCT patent publication Nos. WO 
90/15070 and 92/10092, is considered pioneering technology in the fields of 
combinatorial synthesis and screening of combinatorial libraries. More 
recently, patent application Serial No. 08/082,937, filed June 25, 1993, 
describes methods for making arrays of oligonucleotide probes that can be used 
to check or determine a partial or complete sequence of a target nucleic acid 
and to detect the presence of a nucleic acid containing a specific 
oligonucleotide sequence. 

In brief, the light-directed combinatorial synthesis of oligonucleotide 
arrays on a glass surface proceeds using automated phosphoramidite chemistry 
and chip masking techniques. In one specific implementation, a glass surface 
is derivatized with a silane reagent containing a functional group, e.g., a 
hydroxyl or amine group blocked by a photolabile protecting group. Photolysis 
through a photolithogaphic mask is used selectively to expose functional 
groups which are then ready to react with incoming 5'-photoprotected 
nucleoside phosphoramidites. The phosphoramidites react only with those sites 
which are illuminated (and thus exposed by removal of the photolabile blocking 
group). Thus, the phosphoramidites only add to those areas selectively 
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array of sequences have been synthesized on the soiin surface. Combinatorial 
synthesis of different oligonucleotide analogues at different locations on the 
array is determined by the pattern of illumination during synthesis and the order 
of addition of coupling reagents. 
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In the event that an oligonucleotide analogue with a polyamide 
backbone is used in the VLSIPS™ procedure, it is generally inappropriate to 
use phosphoramidite chemistry to perform the synthetic steps, since the 
monomers do not attach to one another via a phosphate linkage. Instead, 
peptide synthetic methods are substituted. See, e.g., Pirrung etal. U.S. Pat 
No. 5,143,854 

Peptide nucleic acids are commercially available from, e.g., Biosearch, 
Inc. (Bedford, MA) which comprise a polyamide backbone and the bases found 
in naturally occurring nucleosides. Peptide nucleic acids are capable of binding 
to nucleic acids with high specificity, and are considered "oligonucleotide 
analogues" for purposes of this disclosure. 

In addition to the foregoing, additional methods which can be used to 
generate an array of oligonucleotides on a single substrate are described in co- 
pending Applications Ser. No. 07/980,523, filed November 20, 1992, and 
07/796,243, filed November 22, 1991 and in PCT Publication No. WO 
93/09668. In the methods disclosed in these applications, reagents are 
delivered to the substrate by either (1) flowing within a channel defined on 
predefined regions or (2) "spotting" on predefined regions or (3) through the 
use of photoresist. However, other approaches, as well as combinations of 
spotting and flowing, may be employed. In each instance, certain activated 
regions of the substrate are mechanically separated from other regions when 
the monomer solutions are delivered to the various reaction sites. 

A typical "flow channel" method applied to the compounds and libraries 
of the present invention can generally be described as follows. Diverse polymer 
sequences are synthesized at selected regions of a substrate or solid support by 
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reagents flow or in which appropriate reagents are placed. For example, 
assume a monomer "A" is to be bound to the substrate in a first group 
of selected regions. If necessary, all or part of the surface of the substrate in 
all or a part of the selected regions is activated for binding by, for example, 
flowing appropriate reagents through all or some of the channels, or by 
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washing the entire substrate with appropriate reagents. After placement of a 
channel block on the surface of the substrate, a reagent having the monomer 
A flows through or is placed in all or some of the channel(s). The channels 
provide fluid contact to the first selected regions, thereby binding the monomer 
A on the substrate directly or indirectly (via a spacer) in the first selected 
regions. 

Thereafter, a monomer B is coupled to second selected regions, some 
of which may be included among the first selected regions. The second 
selected regions will be in fluid contact with a second flow channel(s) through 
translation, rotation, or replacement of the channel block on the surface of the 
substrate; through opening or closing a selected valve; or through deposition 
of a layer of chemical or photoresist. If necessary, a step is performed for 
activating at least the second regions. Thereafter, the monomer B is flowed 
through or placed in the second flow channel(s), binding monomer B at the 
second selected locations. In this particular example, the resulting sequences 
bound to the substrate at this stage of processing will be, for example, A, B, 
and AB. The process is repeated to form a vast array of sequences of desired 
length at known locations on the substrate. 

After the substrate is activated, monomer A can be flowed through 
some of the channels, monomer B can be flowed through other channels, a 
monomer C can be flowed through still other channels, etc. In this manner, 
many or all of the reaction regions are reacted with a monomer before the 
channel block must be moved or the substrate must be washed and/or 
reactivated. By making use of many or all of the available reaction regions 
simultaneously, the number of washing and activation steps can be minimized. 
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forming channels or otherwise protecting a portion of the surface of the 
substrate. For example, according to some embodiments, a protective coating 
such as a hydrophilic or hydrophobic coating (depending upon the nature of the 
solvent) is utilized over portions of the substrate to be protected, sometimes 
in combination with materials that facilitate wetting by the reactant solution in 



23 



other regions. In this manner, the flowing solutions are further prevented from 
passing outside of their designated flow paths. 

High density nucleic acid arrays can be fabricated by depositing 
presynthezied or natural nucleic acids in predined positions. Synthesized or 
natural nucleic acids are deposited on specific locations of a substrate by light 
directed targeting and oligonucleotide directed targeting. Nucleic acids can 
also be directed to specific locations in much the same manner as the flow 
channel methods. For example, a nucleic acid A can be delivered to and 
coupled with a first group of reaction regions which have been appropriately 
activated. Thereafter, a nucleic acid B can be delivered to and reacted with a 
second group of activated reaction regions. Nucleic acids are deposited in 
selected regions. Another embodiment uses a dispenser that move's from region 
to region to deposit nucleic acids in specific spots. Typical dispensers include 
a micropipette or capillary pin to deliver nucleic acid to the substrate and a 
robotic system to control the position of the micropipette with respect to the 
substrate. In other embodiments, the dispenser includes a series of tubes, a 
manifold, an array of pipettes or capillary pins, or the like so that various 
reagents can be delivered to the reaction regions simultaneously. 

3. Hybridization 

Nucleic acid hybridization simply involves contacting a probe and target 
nucleic acid under conditions where the probe and its complementary target 
can form stable hybrid duplexes through complementary base pairing. The 
nucleic acids that do not form hybrid duplexes are then washed away leaving 
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attached detectable label, h \$ generally r^ognized that nucleic »oids are 
denatured by increasing the temperature or decreasing the salt concentration 
of the buffer containing the nucleic acids. Under low stringency conditions 
(e.g., low temperature and/or high salt) hybrid duplexes (e.g., DNA:DNA, 
RNA:RNA> or RNA:DNA) will form even where the annealed sequences are 
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not perfectly complementary. Thus specificity of hybridization is reduced at 
lower stringency. Conversely, at higher stringency (e.g., higher temperature 
or lower salt) successful hybridization requires fewer mismatches. 

One of skill in the art will appreciate that hybridization conditions may 
be selected to provide any degree of stringency. In a preferred embodiment, 
hybridization is performed at low stringency in this case in 6X SSPE-T at 37 
C (0.005% Triton X-100) to ensure hybridization and then subsequent washes 
are performed at higher stringency (e.g., 1 X SSPE-T at 37 C) to eliminate 
mismatched hybrid duplexes. Successive washes may be performed at 
increasingly higher stringency (e.g., down to as low as 0.25 X SSPE-T at 37 
C to 50 C) until a desired level of hybridization specificity is obtained. 
Stringency can also be increased by addition of agents such a~s formamide. 
Hybridization specificity may be evaluated by comparison of hybridization to 
the test probes with hybridization to the various controls that can be present 
(e.g., expression level control, normalization control, mismatch controls, etc.). 

In general, there is a tradeoff between hybridization specificity 
(stringency) and signal intensity. Thus, in a preferred embodiment, the wash 
is performed at the highest stringency that produces consistent results and that 
provides a signal intensity greater than approximately 10% of the background 
intensity. Thus, in a preferred embodiment, the hybridized array may be 
washed at successively higher stringency solutions and read between each 
wash. Analysis of the data sets thus produced will reveal a wash stringency 
above which the hybridization pattern is not appreciably altered and which 
provides adequate signal for the particular oligonucleotide probes of interest. 



detergent (e.g., C-TAB) or a blocking reagent (e.g., sperm DNA cot-) DNA 
etc.) during the hybridization to reduce non-specific binding. In a particularly 
preferred embodiment, the hybridization is performed in the presence of about 
0.5 mg/ml DNA (e.g., herring sperm DNA). The use of blocking agents in 
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hybridization is well known to those of skill in the art (see, e.g., Chapter 8 in 
P. Tijssen, supra.) 

The stability of duplexes formed between RNAs or DNAs are generally 
in the order of RNA RNA > RNA DNA > DNADNA, in solution. 
Long probes have better duplex stability with a target, but poorer mismatch 
discrimination than shorter probes (mismatch discrimination refers to the 
measured hybridization signal ratio between a perfect match probe and a single 
base mismatch probe). Shorter probes (e.g., 8-mers) discriminate mismatches 
very well, but the overall duplex stability is low. 

Altering the thermal stability (T m ) of the duplex formed between the 
target and the probe using, e.g., known oligonucleotide analogues allows for 
optimization of duplex stability and mismatch discrimination. One useful 
aspect of altering the T m arises from the fact that adenine-thymine (A-T) 
duplexes have a lower T m than guanine-cytosine (G-C) duplexes, due in part 
to the fact that the A-T duplexes have 2 hydrogen bonds per base-pair, while 
the G-C duplexes have 3 hydrogen bonds per base pair. In heterogeneous 
oligonucleotide arrays in which there is a non-uniform distribution of bases, it 
is not generally possible to optimize hybridization for each oligonucleotide 
probe simultaneously. Thus, in some embodiments, it is desirable to selectively 
destabilize G-C duplexes and/or to increase the stability of A-T duplexes. This 
can be accomplished, e.g., by substituting guanine residues in the probes of an 
array which form G-C duplexes with hypoxanthine, or by substituting adenine 
residues in probes which form A-T duplexes with 2,6 diaminopurine or by 
using the salt tetramethyl ammonium chloride (TMAC1) in place of NaCL 

Altered duplex stability conferred by using oligonucleotide analogue 
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oligonucleotide anal^gus an ays hybridized with a target Oligonucleotide over 
time. The data allow optimization of specific hybridization conditions at, e.g., 
room temperature (for simplified diagnostic applications in the future). 
Another way of verifying altered duplex stability is by following the signal 
intensity generated upon hybridization with time. Previous experiments using 



26 



DNA targets and DNA chips have shown that signal intensity increases with 
time, and that the more stable duplexes generate higher signal intensities faster 
than less stable duplexes. The signals reach a plateau or "saturate" after a 
certain amount of time due to all of the binding sites becoming occupied. 
5 These data allow for optimization of hybridization, and determination of the 

best conditions at a specified temperature. 

Methods of optimizing hybridization conditions are well known to 
those of skill in the art {see, e.g., Laboratory Techniques in Biochemistry and 
Molecular Biology, Vol 24: Hybridization With Nucleic Acid Probes, P. 
10 Tijssen, ed. Elsevier, N.Y., (1993)). 

•m 

(C) Signal Detection 

In a preferred embodiment, the hybridized nucleic acids are detected by 
detecting one or more labels attached to the sample nucleic acids. The labels 
may be incorporated by any of a number of means well known to those of skill 
15 in the art. However, in a preferred embodiment, the label is simultaneously 

incorporated during the amplification step in the preparation of the sample 
nucleic acids. Thus, for example, polymerase chain reaction (PCR) with 
labeled primers or labeled nucleotides will provide a labeled amplification 
product. In a preferred embodiment, transcription amplification, as described 
20 above, using a labeled nucleotide (e.g. fluorescein-labeled UTP and/or CTP) 

incorporates a label into the transcribed nucleic acids. 

Alternatively, a label may be added directly to the original nucleic acid 
sample (e.g., mRNA, polyA mRNA, cDNA, etc.) or to the amplification 
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25 nucleic acids are well known to those of skill in the art and include for examnle 

nick translation or end-labeling (e.g. with a labeled RNA) by kinasing of the 
nucleic acid and subsequent attachment (ligation) of a nucleic acid linker 
joining the sample nucleic acid to a label (e.g., a fluorophore). 
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Detectable labels suitable for use in the present invention include any 
composition detectable by spectroscopic, photochemical, biochemical, 
immunochemical, electrical, optical or chemical means. Useful labels in the 
present invention include biotin for staining with labeled streptavidin conjugate, 
magnetic beads (e.g., Dynabeads™), fluorescent dyes (e.g., fluorescein, texas 
red, rhodamine, green fluorescent protein, and the like), radiolabels (e.g., 3 H, 
125 I, 3 5S, Mc, or 3 2P), enzymes (e.g., horse radish peroxidase, alkaline 
phosphatase and others commonly used in an ELISA), and colorimetric labels 
such as colloidal gold or colored glass or plastic (e.g., polystyrene, 
polypropylene, latex, etc.) beads. Patents teaching the use of such labels 
include U.S. Patent Nos. 3,817,837; 3,850,752; 3,939,350; 3,996,345; 
4,277,437; 4,275,149; and 4,366,241. 

Means of detecting such labels are well known to those of skill in the 
art. Thus, for example, radiolabels may be detected using photographic film 
or scintillation counters, fluorescent markers may be detected using a 
photodetector to detect emitted light. Enzymatic labels are typically detected 
by providing the enzyme with a substrate and detecting the reaction product 
produced by the action of the enzyme on the substrate, and colorimetric labels 
are detected by simply visualizing the colored label. One particular preferred 
methods uses colloidal gold label that can be detected by measuring scattered 
light. 

The label may be added to the target (sample) nucleic acid(s) prior to, 
or after the hybridization. So called "direct labels" are detectable labels that are 
directly attached to or incorporated into the target (sample) nucleic acid prior 
to hybridization. In contrast, so called "indirect labels" are joined to the hybrid 
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moiety that has been attached to the target nucleic acid Drior to the 
hybridization. Thus, for example, the target nucleic acid may be biotinylated 
before the hybridization. After hybridization, an aviden-conjugated fluorophore 
will bind the biotin bearing hybrid duplexes providing a label that is easily 
detected. For a detailed review of methods of labeling nucleic acids and 
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detecting labeled hybridized nucleic acids see Laboratory Techniques in 
Biochemistry and Molecular Biology, Vol. 24: Hybridization With Nucleic 
Acid Probes, P. Tijssen, ed. Elsevier, N.Y., (1993)). 

Fluorescent labels are preferred and easily added during an in vitro 
transcription reaction. In a preferred embodiment, fluorescein labeled UTP and 
CTP are incorporated into the RNA produced in an in vitro transcription 
reaction as described above. 

Means of detecting labeled target (sample) nucleic acids hybridized to 
the probes of the high density array are known to those of skill in the art. 
Thus, for example, where a colorimetric label is used, simple visualization of 
the label is sufficient. Where a radioactive labeled probe is used, detection of 
the radiation {e.g. with photographic film or a solid state detector) is sufficient. 
In a preferred embodiment, however, the target nucleic acids are labeled with 
a fluorescent label and the localization of the label on the probe array is 
accomplished with fluorescent microscopy. The hybridized array is excited 
with a light source at the excitation wavelength of the particular fluorescent 
label and the resulting fluorescence at the emission wavelength is detected. In 
a particularly preferred embodiment, the excitation light source is a laser 
appropriate for the excitation of the fluorescent label. 

The confocal microscope may be automated with a computer- 
controlled stage to automatically scan the entire high density array. Similarly, 
the microscope may be equipped with a phototransducer {e.g., a 
photomultiplier, a solid state array, a CCD camera, etc.) attached to an 
automated data acquisition system to automatically record the fluorescence 
signal produced by hybridization to each oligonucleotide probe on the array. 
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Pr.T Application 20 92'1C0Q2. and conentmn/ ITS Amplication Ser No. 
08/195,889 filed on February 10, 1994. Use of laser illumination in 
conjunction with automated confocal microscopy for signal detection permits 
detection at a resolution of better than about 100 ^im, more preferably better 
than about 50 pm, and most preferably better than about 25 pm. 
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One of skill in the art will appreciate that methods for evaluating the 
hybridization results vary with the nature of the specific probe nucleic acids 
used as well as the controls provided. In the simplest embodiment, simple 
quantification of the fluorescence intensity for each probe is determined. This 
is accomplished simply by measuring probe signal strength at each location 
(representing a different probe) on the high density array (e.g., where the label 
is a fluorescent label, detection of the amount of florescence (intensity) 
produced by a fixed excitation illumination at each location on the array). 
Comparison of the absolute intensities of an array hybridized to nucleic acids 
from a "test" sample with intensities produced by a "control" sample provides 
a measure of the relative expression of the nucleic acids that hybridize to each 
of the probes. 

One of skill in the art, however, will appreciate that hybridization 
signals will vary in strength with efficiency of hybridization, the amount of label 
on the sample nucleic acid and the amount of the particular nucleic acid in the 
sample. Typically nucleic acids present at very low levels (e.g., < lpM) will 
show a very weak signal. At some low level of concentration, the signal 
becomes virtually indistinguishable from background. In evaluating the 
hybridization data, a threshold intensity value may be selected below which a 
signal is not counted as being essentially indistinguishable from background. 
Where it is desirable to detect nucleic acids expressed at lower levels, a lower 
threshold is chosen. Conversely, where only high expression levels are to be 
evaluated a higher threshold level is selected. In a preferred embodiment, a 
suitable threshold is about 10% above that of the average background signal. 
In addition, the provision of appropriate controls permits a more detailed 
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specify binding and the lik* Thus Tor example, in a preferred embodiment, 
the hybridization array is provided with normalization controls These 
normalization controls are probes complementary to control sequences added 
in a known concentration to the sample. Where the overall hybridization 
conditions are poor, the normalization controls will show a smaller signal 
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reflecting reduced hybridization. Conversely, where hybridization conditions 
are good, the normalization controls will provide a higher signal reflecting the 
improved hybridization. Normalization of the signal derived from other probes 
in the array to the normalization controls thus provides a control for variations 
in hybridization conditions. Typically, normalization is accomplished by 
dividing the measured signal from the other probes in the array by the average 
signal produced by the normalization controls. Normalization may also include 
correction for variations due to sample preparation and amplification. Such 
normalization may be accomplished by dividing the measured signal by the 
average signal from the sample preparation/amplification control probes (e.g., 
the Bio B probes). The resulting values may be multiplied by a constant value 
to scale the results. 

As indicated above, the high density array can include mismatch 
controls. In a preferred embodiment, there is a mismatch control having a 
central mismatch for every probe (except the normalization controls) in the 
array. It is expected that after washing in stringent conditions, where a perfect 
match would be expected to hybridize to the probe, but not to the mismatch, 
the signal from the mismatch controls should only reflect non-specific binding 
or the presence in the sample of a nucleic acid that hybridizes with the 
mismatch. Where both the probe in question and its corresponding mismatch 
control both show high signals, or the mismatch shows a higher signal than its 
corresponding test probe, there is a problem with the hybridization and the 
signal from those probes is ignored. The difference in hybridization signal 
intensity between the target specific probe and its corresponding mismatch 
control is a measure of the discrimination of the target-specific probe. Thus, 
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the signal from its corresponding test probe to provide a measure of the signal 
due to specific binding of the test probe. 

The concentration of a particular sequence can then be determined by 
measuring the signal intensity of each of the probes that bind specifically to that 
gene and normalizing to the normalization controls. Where the signal from the 
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probes is greater than the mismatch, the mismatch is subtracted. Where the 
mismatch intensity is equal to or greater than its corresponding test probe, the 
signal is ignored. The expression level of a particular gene can then be scored 
by the number of positive signals (either absolute or above a threshold value), 
the intensity of the positive signals (either absolute or above a selected 
threshold value), or a combination of both metrics (e.g., a weighted average). 

In some preferred embodiments, a computer system is used to compare 
the hybridization intensities of the perfect match and mismatch probes of each 
pair. If the gene is expressed, the hybridization intensity (or affinity) of a 
perfect match probe of a pair should be recognizably higher than the 
corresponding mismatch probe. Generally, if the hybridizations intensities of 
a pair of probes are substantially the same, it may indicate the gene is not 
expressed. However, the determination is not based on a single pair of probes, 
the determination of whether a gene is expressed is based on an analysis of 
many pairs of probes. 

After the system compares the hybridization intensity of the perfect 
match and mismatch probes, the system indicates expression of the gene. 
As an example, the system may indicate to a user that the gene is either 
present (expressed), marginal or absent (unexpressed). Specific procedures 
for data analysis is disclosed in U.S. Application 08/772,376, previously 
incorporated for all purposes. 

In addition to high density nucleic acid arrays, other methods are 
also useful for massive gene expression monitoring. Differential display, 
described by Liang, P. and Pardee, A.B. (Differential Display of eukaryotic 
messenger RNA by means of the polymerase chain reaction. Science 

provides a useful mean for distinguishing gene expression between two 
samples. Serial analysis of gene expression, described by Velculescu et al. 
(Serial Analysis of Gene Expression. Science, 270:484-487, 1995, 
incorporated herein by reference for all purposes) provides another method 
for quantative and qualitative analysis of gene expression. Optical fiber 



32 



oligonucleotide sensors, described by Ferguson et al. (A Fiber-optic DNA 
biosensor microarray for the analysis of gene expression. Nature- 
Biotechnology 14 1681-1684, 1996), can also be used for gene expression 
monitoring. 

The following terminology is relevant to the use of gene arrays for 
determining levels of expression of particular genes. 
Massive Parallel Screening The phrase "massively parallel screening" 
refers to the simultaneous screening of at least about 100, preferably about 
1000, more preferably about 10,000 and most preferably about 1,000,000 
different nucleic acid hybridizations. 

Mismatch control; The term "mismatch control" or "mismatch "probe" refer 
to a probe whose sequence is deliberately selected not to be perfectly 
complementary to a particular target sequence. For each mismatch (MM) 
control in a high-density array there typically exists a corresponding perfect 
match (PM) probe that is perfectly complementary to the same particular 
target sequence. The mismatch may comprise one or more bases. While the 
mismatch(s) may be located anywhere in the mismatch probe, terminal 
mismatches are less desirable as a terminal mismatch is less likely to prevent 
hybridization of the target sequence. In a particularly preferred 
embodiment, the mismatch is located at or near the center of the probe such 
that the mismatch is most likely to destabilize the duplex with the target 
sequence under the test hybridization conditions. 

mRNA or transcript; The term "mRNA" refers to transcripts of a gene. 
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for translation, products of various stages of transcript processing 
Transcript processing may include splicing, editing and degradation. 

Perfect match prpbe; The term "perfect match probe" refers to a probe 
that has a sequence that is perfectly complementary to a particular target 
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sequence. The test probe is typically perfectly complementary to a portion 
(subsequence) of the target sequence. The perfect match (PM) probe can be 
a "test probe", a "normalization control" probe, an expression level control 
probe and the like. A perfect match control or perfect match probe is, 
however, distinguished from a "mismatch control" or "mismatch probe." 

Quantifying; The term "quantifying" when used in the context of 
quantifying transcription levels of a gene can refer to absolute or to relative 
quantification. Absolute quantification may be accomplished by inclusion of 
known concentration(s) of one or more target nucleic acids (e.g. control 
nucleic acids such as Bio B or with known amounts the target nucleic acids 
themselves) and referencing the hybridization intensity of unknowns with the 
known target nucleic acids (e.g. through generation of a standard curve). 
Alternatively, relative quantification can be accomplished by comparison of 
hybridization signals between two or more genes, or between two or more 
treatments to quantify the changes in hybridization intensity and, by 
implication, transcription level. 

Up-Stre am or down-stream gene. If the expression of a first gene is 
regulated by a second gene, the second gene is called an "up-stream gene" 
for the first gene and the first gene is the "down-stream" gene of the second 
gene. The regulation of the first gene by second gene could be through 
trans-activation. For example, the first gene encodes a transcriptional factor 
that controls the expression of the second gene. Alternatively, regulation 
can be by inhibition of transcript degradation. Regulation can also be by 
inhibition of translation of a transcript. Still other modes of regulation are 

It is understood that the examples and embodiments descried herein 
are for illustrative purposes only and that various modifications or changes 
in light thereof will be suggested to persons skilled in the art and are to be 
included within the spirit and purview of this application and scope of the 
appended claims. Many variations of the invention will be apparent to those 
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of skill in the art upon reviewing the above description. By way of example, 
the invention has been described primarily with reference to the use of a high 
density oligonucleotide array, but it will be readily recognized by those of 
skill in the art that other nucleic acid arrays, other methods of measuring 
transcript levels and gene expression monitoring at the protein level could 
be used The scope of the invention should, therefore, be determined not 
with reference to the above description, but should instead be determined 
with reference to the appended claims, along with the full scope of 
equivalents to which such claims are entitled. All publications, patents, and 
patent applications cited herein are hereby incorporated by reference for all 
purposes. 

EXAMPLES 

Exam ple 1 , Ectopic expression of a transg ene 

One application of the present method involves ectopically 
expressing 50-100 oncogenes and tumor suppressor genes in a "shotgun" 
approach to identify novel relationships among proteins. Many of the genes 
that have been causally linked to cancer development will likely cluster in a 
limited number of key cellular pathways. Several examples exist in which 
genes mutated in disparate contexts turn out to play roles in common 
pathways. Perhaps the most striking case involves a signaling network that 
regulates the cell cycle. At least seven components of this network (ATM, 
p53, MDM2, pi 6, cdk4, cyclin Dl, and RB) exhibit frequent alterations in 
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plays in tumor formation. 

One model system involves the ATM, p53, and p2l proteins. These 
three proteins act in a DNA damage-response growth regulatory axis in 
which ATM stimulates p53 activity, and p53 transcriptionally activates p2l. 
However, the exact biochemical relationships among these molecules and 
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as-yet unidentified pathway members remains unclear. Primary embryonic 
fibroblasts from wild-type mice, as well as from mice individually null for 
ATM, p53, and p21 can be used. Cells of each genotype can be 
gamma-irradiated in order to activate the DNA damage-response pathway, 
and expression profiles can be generated from irradiated (at 1 and 4 hours 
post-irradiation) and non-irradiated cells. 

Two other model systems involve ectopic expression of transgenes 
in cultured cell lines. U20S osteosarcoma cells stably transfected with 
tetracycline-repressible WT1 (Wilm's Tumor susceptibility gene) or BRCA1 
(Breast cancer susceptibility gene) expression constructs can also be used. 
In this so-called "tet-ofT induction system , the recombinant gene is induced 
by withdrawal of tetracycline from the tissue culture medium. Upon 
induction of WT1 and BRCA1 expression, we identified 18 and 16 
endogenous genes, respectively (of 7000 genes monitored), that displayed 
reproducible expression changes of 3 -fold or greater. None of the candidate 
functional mediators (CFMs) identified in the WT1 expression screen 
overlapped with those identified in the BRCA1 screen, indicating that these 
expression changes were not artifactually linked to the induction system or 
host cell type employed in the screening. Literature searches indicated that 
roughly half of the CFMs identified in each study had either correlative or 
causative links to tumorigenesis, and several have been proposed, or are in 
clinical use, as prognostic or diagnostic markers. These findings support the 
physiological relevance of expression changes identified in cell culture-based 
recombinant expression systems. Putative mechanistic advantages of 
BRCA1 inactivation in tumors can be inferred from the identity of its CFMs 
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profiling experiments. 

An iterative expression profiling strategy can be used in which CFMs 
identified in an initial screen are themselves recombinantly expressed in 
subsequent screens. For example, using the tet-off induction system, we can 
ectopically express one WT1 CFM (amphiregulin) and one BRCA1 CFM 
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(GADD45) following stable transfection in U20S cells. Amphiregulin and 
GADD45 transfectants can be profiled pre- and post-induction. The 
resultant expression changes constitute a subset of those observed following 
BRCA1 and WT1 induction. 
5 Amphiregulin and GADD45 have been expressed in U20S cells in 

transient transfection experiments, and the resulting phenotype (reduced 
colony formation, in both cases) was identical to that obtained following 
transient transfection of either BRCA1 or WT1 in these same cells. 
Furthermore, in in situ hybridization experiments WT1 and amphiregulin 

10 co-localized to developing human glomeruli during identical stages of 

embryogenesis. Thus amphiregulin and GADD45 are implicated as 
transducers of WT1 and BRCA1 tumor suppressive function. Other data 
are consistent with this finding. For example, treatment of several human 
carcinoma cell lines with soluble amphiregulin (a member of the EGF family) 

15 results in growth inhibition. The BRCA1 and GADD45 proteins have been 

correlatively linked to DNA damage response. BRCA1 binds to Rad51, a 
homolog of bacterial recA involved in DNA repair following exposure to 
ionizing radiation. Moreover, treatment of cell lines with ionizing radiation 
induces alterations in BRCA1 phosphorylation and subnuclear localization, 

20 as well as GADD45 transcription. Lastly, GADD45 binds PCNA, a 

component of DNA replication and repair complexes, and enhances 
nucleotide excision repair in vitro. Together, the above data suggest that 
GADD45 functions downstream of BRCA1, and that amphiregulin functions 
downstream of WT1. 

25 The amphiregulin and GADD45 expression constructs can be 
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screened using custom polymorphism-detection arr*y« or by standard uel- 
based sequencing methods to ensure that no mutations were introduced 
during PCR amplification. Expression vectors can be stably transfected into 
30 U20S cells, and clones expressing high levels of the recombinant genes can 

be identified by hybridization of RNA from those cell lines to these same 
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polymorphism-detection arrays (using different software, 
polymorphism-detection arrays can also be used for expression monitoring). 
Following subsequent preparative-scale inductions, RNA can be isolated, 
biotin-labeled, and hybridized to oligonucleotide arrays. The chips can then 
be washed, treated with streptavidin-coupled phycoerythrin (to link a 
fluorophore to the biotinylated RNA), and scanned with a laser confocal 
fluorescence microscope. Fluorescence intensities can be used to calculate 
absolute mRNA abundances, as well as expression differences from sample 
to sample. RNA derived from the ATM-null, p53-null, p21-null, and 
wildtype murine cells can be treated in the same fashion. 

Example 2, Hyb ridization -Ba s ed A ssay for Generating Expression Profiles 
Messenger RNA levels are determined by hybridization of complete 
mRNA populations to sets of arrays containing hundreds of thousands of 
chemically synthesized oligonucleotides . The oligonucleotides are 
synthesized in situ on glass supports using light-directed, solid-phase 
combinatorial chemistry . Because the arrays are designed and synthesized 
based on sequence information alone, they provide a direct link between 
genomic sequence and measurements of differential gene expression. Each 
synthesis feature consists of more than 10 7 copies of a particular 
oligonucleotide. 

For each mRNA sample, the expression levels of thousands of 
fiill-length human genes can be monitored. For each gene, up to 20 
complementary oligomers are chosen based on automated selection criteria. 
The criteria include tests for sequence uniqueness relative to the rest of the 
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clusters of single nucleotides) that !iav« been determined to ad^erselv affect 
hybridization behavior on arrays. The use of sets of oligonucleotides for 
each gene provides redundancy in the detection and analysis of the data, 
mitigates the potentially confounding effects of occasional 
cross-hybridization, and makes it so all oligonucleotides do not have to 
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hybridize identically in order to obtain quantitative information. To further 
increase the sensitivity and specificity of detection, each complementary 
oligonucleotide (perfect match, or PM) is synthesized with a closely related 
mismatch (MM) partner in a physically adjacent position. The mismatch 
partner is identical except for a single base difference at the central position 
of the oligomer. The MM oligonucleotide of each pair serves as an internal 
control that allows consistent hybridization patterns (patterns of PM signals 
that are larger than the corresponding MM signals) to be recognized. 
Quantitative image analysis is based on the average of the differences 
between the PM and MM partners, so that nonspecific and background 
contributions tend to cancel, while specific hybridization signals tend to add 
constructively across the set of oligonucleotide pairs for each gene. These 
hybridization signals are quantitative over three orders of magnitude, from 
1:300,000 to 1:300. 



Example 3. Comput ational Modeling 

Computational models can be tailored to the analysis of simple 
experimental systems. They lend themselves well to visual representation of 
signaling networks. They provide a convenient mechanism to facilitate 
communication between biologists and computer scientists. 

We have chosen a set of reductionist experimental systems that can 
yield data about the connectivity of these networks. Given connectivity data 
for such networks, very simple combinatorial models allow us to deduce 
significant properties of these networks. Such reverse-engineering 
procedures require that the experimental data on which they are based can 

The systems described in the previous sections disr»I«v the common 
feature of an initiating genetic alteration resulting in expression effects on 
other genes. We have devised a simple model (exemplified in Figure 2) that 
examines the connectivity of a signaling network. We can examine this 
model, first in an idealized case, and then adding more details obtained from 
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experimentation. In our model, the network is represented using a "graph," 
genes correspond to "nodes/' and directional signals between pairs of genes 
correspond to "edges." The true graph reflects the actual relationships 
among the nodes (example in Fig 2a). Either a gene, A, affects another gene, 
5 B, directly (the edge exists), and transmits change, or gene B is not directly 

affected by gene A, in which case the edge does not exist. Note that 
"direct" in this case means without involvement of other intermediary genes 

In contrast to most other models, we represent the experimentally 
observed states of genes in a relative sense as "changed" or "unchanged", 

10 rather than in an absolute sense as "expressed" or "unexpressed", since we 

are interested in the flow of signals along pathways. In this model, the 
genes in our graph possess only two states, "0", standing for "unchanged 
with respect to a control", and "1", standing for "significantly changed with 
respect to a control." Data from an experiment is reduced to a collection of 

15 genes, each either in state 0 or 1 . Experimental induction or deletion of a 

gene, A, can then produce expression changes in a set of genes (Example 
2b). The genes in this set are exactly those genes that have paths leading to 
them from A (those nodes connected, directly or indirectly, to node A in the 
true graph). Furthermore, if genes have paths leading to them from gene A, 

20 then there is a path leading from A to every gene connected to A. In this 

case, the set of genes associated with induction of gene B is a subset of the 
genes associated with induction of gene A, and gene B has a path leading to 
it from gene A. 

In our two ectopic expression systems, we have already determined 
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induction. We can generate expression profiles following «mphiregvi!in and 
GADD45 induction and assess whether the affected genes are a subset of 
those altered by WT1 and BRCA1 induction, respectively. Conversely, in 
the gene knockout model, we expect to observe expression subsets 
SO reflecting the absence of altered expression following DNA damage (as 
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compared to the response of wildtype cells to DNA damage) For example, 
deletion of p53 should result in failure to produce expression changes in a 
set of genes; the affected genes should be a subset of those that fail to 
change in expression following DNA damage in ATM-null cells. 
5 There are some combinatorial relationships that may not be captured 

by gene-induction systems or gene-deletion systems alone. For example, if 
upregulation of gene A and gene B is necessary to induce gene C, then a 
change in gene A alone can not necessarily affect gene C. These 
relationships could however be assessed using a system in which the 

10 expression of genes A and B can be individually suppressed (as in a gene 

knockout). In this case, down-regulation of either gene A or gene B 
(assuming that they are expressed at the beginning of the experiment) can 
affect gene C. Mathematically, we can easily capture "or" relationships, but 
not "and" relationships. However, as we just described, we can always 

15 represent an "and" relationship as an "or" relationship: A and B = not ((not 

A) or (not B)). Thus, combining experimental systems to enable both up- 
regulation and down-regulation is necessary to assess such relationships. 

In an ideal case, we obtain connectivity information for each gene. 
In this case we can obtain an exact representation of the transitive closure of 

20 the true graph. This transitive closure is that graph in which, for any two 

nodes A and B, there is an edge from A to B if and only if there is a path 
from A to B in the graph G (Example 2c). The connectivity data table in 
this case is exactly the adjacency matrix for the transitive closure graph. 
There is an edge A->B between two genes A and B in the transitive closure 

25 graph if gene B resides in state 1 in an experiment in which gene A is placed 
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transitive closure ; and hence experiments of this type cannot distinguish the 
true graph from other graphs consistent with the data (Figures 2d and 2e). 

A useful minimal structure for analytical purposes is the "condensed 
30 graph." Any transitive closure graph decomposes into two types of 

components: strongly connected components (SCCs), in which every node 
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is connected to every other, and a directed acyclic component connecting 
the SCCs. A condensed graph is produced by reducing each SCC to a 
corresponding "super" node. This procedure leaves a directed, acyclic graph 
between such nodes. Edges in this acyclic graph may be required by the 
experimental data, or redundant, and it is useful to mark them one way or 
another 

There are several well-known algorithms for producing instances of 
minimal graphs having the same transitive closure as a given graph . These 
algorithms are, in general, computationally infeasible for large graphs, but 
are feasible for the transitive closure graphs discussed here. Every strongly 
connected component may be represented as a cycle on its nodes, allowing 
redundant edges to be removed from the remaining graph, leaving a minimal 
structure summarizing the properties of the graph. Such a minimal structure 
may be a useful visualization tool in summarizing the data set, although it 
may not resemble the true graph. Due to this lack of resemblance, we can 
not invoke such models. 

This simple model of connectivity allows us to map the major 
features of gene regulatory pathways. Even in the most ideal case, obtaining 
the exact signaling network is infeasible, although much information about 
the properties of the network can be obtained. Representing the network as 
a graph allows us to exploit standard combinatorial algorithms. See for 
example, Martello (1982), Khuller (1995), and van Leeuwen (1990), each of 
which is expressly incorporated herein. 

The above disclosure generally describes the present invention. A 
more complete understanding can be obtained by reference to the following 
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only, and are not intended to limit the scope of the invention. 
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